data environment
A Decentralized Retrieval Augmented Generation System with Source Reliabilities Secured on Blockchain
Lu, Yining, Tang, Wenyi, Johnson, Max, Jung, Taeho, Jiang, Meng
Existing retrieval-augmented generation (RAG) systems typically use a centralized architecture, causing a high cost of data collection, integration, and management, as well as privacy concerns. There is a great need for a decentralized RAG system that enables foundation models to utilize information directly from data owners who maintain full control over their sources. However, decentralization brings a challenge: the numerous independent data sources vary significantly in reliability, which can diminish retrieval accuracy and response quality. To address this, our decentralized RAG system has a novel reliability scoring mechanism that dynamically evaluates each source based on the quality of responses it contributes to generate and prioritizes high-quality sources during retrieval. To ensure transparency and trust, the scoring process is securely managed through blockchain-based smart contracts, creating verifiable and tamper-proof reliability records without relying on a central authority. We evaluate our decentralized system with two Llama models (3B and 8B) in two simulated environments where six data sources have different levels of reliability. Our system achieves a +10.7\% performance improvement over its centralized counterpart in the real world-like unreliable data environments. Notably, it approaches the upper-bound performance of centralized systems under ideally reliable data environments. The decentralized infrastructure enables secure and trustworthy scoring management, achieving approximately 56\% marginal cost savings through batched update operations. Our code and system are open-sourced at github.com/yining610/Reliable-dRAG.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (2 more...)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)
Local Performance vs. Out-of-Distribution Generalization: An Empirical Analysis of Personalized Federated Learning in Heterogeneous Data Environments
Hussaini, Mortesa, Theiß, Jan, Stein, Anthony
In the context of Federated Learning with heterogeneous data environments, local models tend to converge to their own local model optima during local training steps, deviating from the overall data distributions. Aggregation of these local updates, e.g., with FedAvg, often does not align with the global model optimum (client drift), resulting in an update that is suboptimal for most clients. Personalized Federated Learning approaches address this challenge by exclusively focusing on the average local performances of clients' models on their own data distribution. Generalization to out-of-distribution samples, which is a substantial benefit of FedAvg and represents a significant component of robustness, appears to be inadequately incorporated into the assessment and evaluation processes. This study involves a thorough evaluation of Federated Learning approaches, encompassing both their local performance and their generalization capabilities. Therefore, we examine different stages within a single communication round to enable a more nuanced understanding of the considered metrics. Furthermore, we propose and incorporate a modified approach of FedAvg, designated as Federated Learning with Individualized Updates (FLIU), extending the algorithm by a straightforward individualization step with an adaptive personalization factor. We evaluate and compare the approaches empirically using MNIST and CIFAR-10 under various distributional conditions, including benchmark IID and pathological non-IID, as well as additional novel test environments with Dirichlet distribution specifically developed to stress the algorithms on complex data heterogeneity.
Policy-Driven AI in Dataspaces: Taxonomy, Explainability, and Pathways for Compliant Innovation
Chandra, Joydeep, Navneet, Satyam Kumar
As AI-driven dataspaces become integral to data sharing and collaborative analytics, ensuring privacy, performance, and policy compliance presents significant challenges. This paper provides a comprehensive review of privacy-preserving and policy-aware AI techniques, including Federated Learning, Differential Privacy, Trusted Execution Environments, Homomorphic Encryption, and Secure Multi-Party Computation, alongside strategies for aligning AI with regulatory frameworks such as GDPR and the EU AI Act. We propose a novel taxonomy to classify these techniques based on privacy levels, performance impacts, and compliance complexity, offering a clear framework for practitioners and researchers to navigate trade-offs. Key performance metrics -- latency, throughput, cost overhead, model utility, fairness, and explainability -- are analyzed to highlight the multi-dimensional optimization required in dataspaces. The paper identifies critical research gaps, including the lack of standardized privacy-performance KPIs, challenges in explainable AI for federated ecosystems, and semantic policy enforcement amidst regulatory fragmentation. Future directions are outlined, proposing a conceptual framework for policy-driven alignment, automated compliance validation, standardized benchmarking, and integration with European initiatives like GAIA-X, IDS, and Eclipse EDC. By synthesizing technical, ethical, and regulatory perspectives, this work lays the groundwork for developing trustworthy, efficient, and compliant AI systems in dataspaces, fostering innovation in secure and responsible data-driven ecosystems.
- North America > United States > Florida (0.04)
- Europe > Switzerland (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- (3 more...)
- Research Report (1.00)
- Overview (1.00)
- Law > Statutes (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Government (1.00)
Context-Aware Rule Mining Using a Dynamic Transformer-Based Framework
Liu, Jie, Zhang, Yiwei, Sheng, Yuan, Lou, Yujia, Wang, Haige, Yang, Bohuan
This study proposes a dynamic rule data mining algorithm based on an improved Transformer architecture, aiming to improve the accuracy and efficiency of rule mining in a dynamic data environment. With the increase in data volume and complexity, traditional data mining methods are difficult to cope with dynamic data with strong temporal and variable characteristics, so new algorithms are needed to capture the temporal regularity in the data. By improving the Transformer architecture, and introducing a dynamic weight adjustment mechanism and a temporal dependency module, we enable the model to adapt to data changes and mine more accurate rules. Experimental results show that compared with traditional rule mining algorithms, the improved Transformer model has achieved significant improvements in rule mining accuracy, coverage, and stability. The contribution of each module in the algorithm performance is further verified by ablation experiments, proving the importance of temporal dependency and dynamic weight adjustment mechanisms in improving the model effect. In addition, although the improved model has certain challenges in computational efficiency, its advantages in accuracy and coverage enable it to perform well in processing complex dynamic data. Future research will focus on optimizing computational efficiency and combining more deep learning technologies to expand the application scope of the algorithm, especially in practical applications in the fields of finance, medical care, and intelligent recommendation.
- North America > United States > California (0.28)
- North America > United States > Minnesota (0.14)
- Asia (0.14)
A Structured Reasoning Framework for Unbalanced Data Classification Using Probabilistic Models
Du, Junliang, Dou, Shiyu, Yang, Bohuan, Hu, Jiacheng, An, Tai
This paper studies a Markov network model for unbalanced data, aiming to solve the problems of classification bias and insufficient minority class recognition ability of traditional machine learning models in environments with uneven class distribution. By constructing joint probability distribution and conditional dependency, the model can achieve global modeling and reasoning optimization of sample categories. The study introduced marginal probability estimation and weighted loss optimization strategies, combined with regularization constraints and structured reasoning methods, effectively improving the generalization ability and robustness of the model. In the experimental stage, a real credit card fraud detection dataset was selected and compared with models such as logistic regression, support vector machine, random forest and XGBoost. The experimental results show that the Markov network performs well in indicators such as weighted accuracy, F1 score, and AUC-ROC, significantly outperforming traditional classification models, demonstrating its strong decision-making ability and applicability in unbalanced data scenarios. Future research can focus on efficient model training, structural optimization, and deep learning integration in large-scale unbalanced data environments and promote its wide application in practical applications such as financial risk control, medical diagnosis, and intelligent monitoring.
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > Switzerland (0.04)
- Banking & Finance (1.00)
- Health & Medicine (0.89)
- Law Enforcement & Public Safety > Fraud (0.69)
- Information Technology > Security & Privacy (0.48)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.77)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.55)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)
A Collaborative Multi-Agent Approach to Retrieval-Augmented Generation Across Diverse Data
Salve, Aniruddha, Attar, Saba, Deshmukh, Mahesh, Shivpuje, Sayali, Utsab, Arnab Mitra
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external, domain-specific data into the generative process. While LLMs are highly capable, they often rely on static, pre-trained datasets, limiting their ability to integrate dynamic or private data. Traditional RAG systems typically use a single-agent architecture to handle query generation, data retrieval, and response synthesis. However, this approach becomes inefficient when dealing with diverse data sources, such as relational databases, document stores, and graph databases, often leading to performance bottlenecks and reduced accuracy. This paper proposes a multi-agent RAG system to address these limitations. Specialized agents, each optimized for a specific data source, handle query generation for relational, NoSQL, and document-based systems. These agents collaborate within a modular framework, with query execution delegated to an environment designed for compatibility across various database types. This distributed approach enhances query efficiency, reduces token overhead, and improves response accuracy by ensuring that each agent focuses on its specialized task. The proposed system is scalable and adaptable, making it ideal for generative AI workflows that require integration with diverse, dynamic, or private data sources. By leveraging specialized agents and a modular execution environment, the system provides an efficient and robust solution for handling complex, heterogeneous data environments in generative AI applications.
- Asia > India > Maharashtra > Pune (0.05)
- Asia > Malaysia > Kuala Lumpur > Kuala Lumpur (0.04)
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
- Research Report (0.64)
- Overview (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.56)
A Theoretical Framework for AI-driven data quality monitoring in high-volume data environments
Bangad, Nikhil, Jayaram, Vivekananda, Krishnappa, Manjunatha Sughaturu, Banarse, Amey Ram, Bidkar, Darshan Mohan, Nagpal, Akshay, Parlapalli, Vidyasagar
This paper presents a theoretical framework for an AI-driven data quality monitoring system designed to address the challenges of maintaining data quality in high-volume environments. We examine the limitations of traditional methods in managing the scale, velocity, and variety of big data and propose a conceptual approach leveraging advanced machine learning techniques. Our framework outlines a system architecture that incorporates anomaly detection, classification, and predictive analytics for real-time, scalable data quality management. Key components include an intelligent data ingestion layer, adaptive preprocessing mechanisms, context-aware feature extraction, and AI-based quality assessment modules. A continuous learning paradigm is central to our framework, ensuring adaptability to evolving data patterns and quality requirements. We also address implications for scalability, privacy, and integration within existing data ecosystems. While practical results are not provided, it lays a robust theoretical foundation for future research and implementations, advancing data quality management and encouraging the exploration of AI-driven solutions in dynamic environments.
- North America > United States > Texas (0.04)
- North America > United States > California (0.04)
- Oceania > New Zealand > North Island > Waikato (0.04)
- (2 more...)
- Overview (1.00)
- Research Report (0.64)
- Information Technology > Security & Privacy (1.00)
- Education (0.87)
Council Post: The Broken Promise Of AI: What Went Wrong Between 2012 And 2022
Lewis Wynne-Jones is the Vice President of Product at ThinkData Works. In 2012, two things happened that set the tone for the next decade of data investments. One was technological, the other was professional, and they both revolutionized the way we think about data. Together, these events directly led to the emergence of artificial intelligence as a business prerogative. Today, however, AI is fraught with problems, and fewer businesses, not more, are saying they're data-driven.
Why Data Cleaning Is Failing Your ML Models - And What To Do About It
Precise endeavors must be done to exacting standards in clean environments. Surgeons scrub in, rocket scientists work in clean rooms, and data scientists…well we try our best. We've all heard the platitude, "garbage in, garbage out," so we spend most of our time doing the most tedious part of the job: data cleaning. Unfortunately, no matter how hard we scrub, poor data quality is often too pervasive and invasive for a quick shower. Our research across the data stacks of more than 150 organizations shows an average of 70 impactful data incidents a year for every 1,000 tables in an environment.
AI Algorithms Could Rapidly Deploy to the Battlefield Under New Initiative
The Pentagon's Joint Artificial Intelligence Center recently started building a joint operating system and integration layer that combatant commands and other military components could eventually use to rapidly make and field artificial intelligence algorithms. This work is one key piece of the center's new Artificial Intelligence and Data Accelerator, or AIDA, JAIC Director Lt. Gen. Michael Groen confirmed this week during the NDIA 2022 Expeditionary Warfare Conference. "AIDA brings us, in small teams, out to the combatant commanders--now, for those of you who have been in combatant commands, or you're familiar with that environment--combat commanders have all of the challenges, all the problems and only some capability, right? And so what we're trying to do from an information advantage perspective is bring them the advantages of good data and good artificial intelligence-generating insights," he explained. Launched last year by Deputy Defense Secretary Kathleen Hicks, AIDA marks a broad initiative to boost data-based decision-making across the military's 11 combatant commands.
- North America > United States (0.39)
- Asia > China (0.39)